Goto

Collaborating Authors

 visual story


KAHANI: Culturally-Nuanced Visual Storytelling Pipeline for Non-Western Cultures

arXiv.org Artificial Intelligence

Large Language Models (LLMs) and Text-To-Image (T2I) models have demonstrated the ability to generate compelling text and visual stories. However, their outputs are predominantly aligned with the sensibilities of the Global North, often resulting in an outsider's gaze on other cultures. As a result, non-Western communities have to put extra effort into generating culturally specific stories. To address this challenge, we developed a visual storytelling pipeline called KAHANI that generates culturally grounded visual stories for non-Western cultures. Our pipeline leverages off-the-shelf models GPT-4 Turbo and Stable Diffusion XL (SDXL). By using Chain of Thought (CoT) and T2I prompting techniques, we capture the cultural context from user's prompt and generate vivid descriptions of the characters and scene compositions. To evaluate the effectiveness of KAHANI, we conducted a comparative user study with ChatGPT-4 (with DALL-E3) in which participants from different regions of India compared the cultural relevance of stories generated by the two tools. Results from the qualitative and quantitative analysis performed on the user study showed that KAHANI was able to capture and incorporate more Culturally Specific Items (CSIs) compared to ChatGPT-4. In terms of both its cultural competence and visual story generation quality, our pipeline outperformed ChatGPT-4 in 27 out of the 36 comparisons.


Metamorpheus: Interactive, Affective, and Creative Dream Narration Through Metaphorical Visual Storytelling

arXiv.org Artificial Intelligence

Human emotions are essentially molded by lived experiences, from which we construct personalised meaning. The engagement in such meaning-making process has been practiced as an intervention in various psychotherapies to promote wellness. Nevertheless, to support recollecting and recounting lived experiences in everyday life remains under explored in HCI. It also remains unknown how technologies such as generative AI models can facilitate the meaning making process, and ultimately support affective mindfulness. In this paper we present Metamorpheus, an affective interface that engages users in a creative visual storytelling of emotional experiences during dreams. Metamorpheus arranges the storyline based on a dream's emotional arc, and provokes self-reflection through the creation of metaphorical images and text depictions. The system provides metaphor suggestions, and generates visual metaphors and text depictions using generative AI models, while users can apply generations to recolour and re-arrange the interface to be visually affective. Our experience-centred evaluation manifests that, by interacting with Metamorpheus, users can recall their dreams in vivid detail, through which they relive and reflect upon their experiences in a meaningful way.


Hands-on with Microsoft's Dall-E 2-based Bing Image Creator: It's good!

PCWorld

Today, Microsoft begins integrating AI art into its AI-powered Bing Chat chatbot with Bing Image Creatorโ€ฆand it's surprisingly good. Microsoft began previewing Image Creator last fall in select markets, and its generative AI art later became the foundation for Microsoft Designer, the excellent design application that also uses AI art to help create templates, flyers, and simple greeting cards. Today, Bing Image Creator will begin integrating with Bing Chat's textual chatbot, but also generate images at its own site, Bing.com/create . Put another way, that means that you'll be able to ask Bing's chatbot to create your own images from an integrated text prompt within Bing Chat, or else use the dedicated site. There's a third option, too: Use the new Edge Copilot sidebar within Microsoft Edge, which has been used for textual generation via AI.


Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

arXiv.org Artificial Intelligence

While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuo-spatial information. Code and data: https://github.com/adymaharana/VLCStoryGan.


deepsing: Generating Sentiment-aware Visual Stories using Cross-modal Music Translation

arXiv.org Artificial Intelligence

In this paper we propose a deep learning method for performing attributed-based music-to-image translation. The proposed method is applied for synthesizing visual stories according to the sentiment expressed by songs. The generated images aim to induce the same feelings to the viewers, as the original song does, reinforcing the primary aim of music, i.e., communicating feelings. The process of music-to-image translation poses unique challenges, mainly due to the unstable mapping between the different modalities involved in this process. In this paper, we employ a trainable cross-modal translation method to overcome this limitation, leading to the first, to the best of our knowledge, deep learning method for generating sentiment-aware visual stories. Various aspects of the proposed method are extensively evaluated and discussed using different songs.


Visual Story Post-Editing

arXiv.org Artificial Intelligence

We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset, VIST-Edit, includes 14,905 human edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We establish baselines for the task, showing how a relatively small set of human edits can be leveraged to boost the performance of large visual storytelling models. We also discuss the weak correlation between automatic evaluation scores and human ratings, motivating the need for new automatic metrics.


Getty Images Is Using Artificial Intelligence to Help Newsrooms Choose Better Photos

#artificialintelligence

Getty Images is embracing artificial intelligence, starting with a way to help publishers pick photos. Today, the photo agency debuted a tool that uses AI to analyze a story and suggest photos that might go along with it depending on the text and content. The tool, called Panels, uses natural language processing--a term for how computers can learn to "read" human words, phrases and sentences--to then match a story based on keywords, images, captions and other criteria. Publishers also will then have access to custom filters and a self-improving algorithm to move around keywords or select images through a more human-driven process. Here's how it works: When someone enters in the URL for a story or copies and pastes in the text, Panels will analyze the words before suggesting people, places and things that appear in the story after weighing different options based on frequency and relevance.